Search | VHL Regional Portal

1.

FAIR-USE4OS: Guidelines for creating impactful open-source software.

Sonabend, Raphael; Gruson, Hugo; Wolansky, Leo; Kiragga, Agnes; Katz, Daniel S.

PLoS Comput Biol ; 20(5): e1012045, 2024 May.

Article in English | MEDLINE | ID: mdl-38722873

ABSTRACT

This paper extends the FAIR (Findable, Accessible, Interoperable, Reusable) guidelines to provide criteria for assessing if software conforms to best practices in open source. By adding "USE" (User-Centered, Sustainable, Equitable), software development can adhere to open source best practice by incorporating user-input early on, ensuring front-end designs are accessible to all possible stakeholders, and planning long-term sustainability alongside software design. The FAIR-USE4OS guidelines will allow funders and researchers to more effectively evaluate and plan open-source software projects. There is good evidence of funders increasingly mandating that all funded research software is open source; however, even under the FAIR guidelines, this could simply mean software released on public repositories with a Zenodo DOI. By creating FAIR-USE software, best practice can be demonstrated from the very beginning of the design process and the software has the greatest chance of success by being impactful.

Subject(s)

Guidelines as Topic , Software , Computational Biology/methods , Software Design , Humans

2.

Special issue on software citation, indexing, and discoverability.

Katz, Daniel S; Chue Hong, Neil P.

PeerJ Comput Sci ; 10: e1951, 2024.

Article in English | MEDLINE | ID: mdl-38660149

ABSTRACT

Software plays a fundamental role in research as a tool, an output, or even as an object of study. This special issue on software citation, indexing, and discoverability brings together five papers examining different aspects of how the use of software is recorded and made available to others. It describes new work on datasets that enable large-scale analysis of the evolution of software usage and citation, that presents evidence of increased citation rates when software artifacts are released, that provides guidance for registries and repositories to support software citation and findability, and that shows there are still barriers to improving and formalising software citation and publication practice. As the use of software increases further, driven by modern research methods, addressing the barriers to software citation and discoverability will encourage greater sharing and reuse of software, in turn enabling research progress.

3.

Group authorship, an excellent opportunity laced with ethical, legal and technical challenges.

Hosseini, Mohammad; Holcombe, Alex O; Kovacs, Marton; Zwart, Hub; Katz, Daniel S; Holmes, Kristi.

Account Res ; : 1-23, 2024 Mar 06.

Article in English | MEDLINE | ID: mdl-38445637

ABSTRACT

Group authorship (also known as corporate authorship, team authorship, consortium authorship) refers to attribution practices that use the name of a collective (be it team, group, project, corporation, or consortium) in the authorship byline. Data shows that group authorships are on the rise but thus far, in scholarly discussions about authorship, they have not gained much specific attention. Group authorship can minimize tensions within the group about authorship order and the criteria used for inclusion/exclusion of individual authors. However, current use of group authorships has drawbacks, such as ethical challenges associated with the attribution of credit and responsibilities, legal challenges regarding how copyrights are handled, and technical challenges related to the lack of persistent identifiers (PIDs), such as ORCID, for groups. We offer two recommendations: 1) Journals should develop and share context-specific and unambiguous guidelines for group authorship, for which they can use the four baseline requirements offered in this paper; 2) Using persistent identifiers for groups and consistent reporting of members' contributions should be facilitated through devising PIDs for groups and linking these to the ORCIDs of their individual contributors and the Digital Object Identifier (DOI) of the published item.

4.

Journal Production Guidance for Software and Data Citations.

Stall, Shelley; Bilder, Geoffrey; Cannon, Matthew; Chue Hong, Neil; Edmunds, Scott; Erdmann, Christopher C; Evans, Michael; Farmer, Rosemary; Feeney, Patricia; Friedman, Michael; Giampoala, Matthew; Hanson, R Brooks; Harrison, Melissa; Karaiskos, Dimitris; Katz, Daniel S; Letizia, Viviana; Lizzi, Vincent; MacCallum, Catriona; Muench, August; Perry, Kate; Ratner, Howard; Schindler, Uwe; Sedora, Brian; Stockhause, Martina; Townsend, Randy; Yeston, Jake; Clark, Timothy.

Sci Data ; 10(1): 656, 2023 09 26.

Article in English | MEDLINE | ID: mdl-37752153

5.

FAIR for AI: An interdisciplinary and international community building perspective.

Huerta, E A; Blaiszik, Ben; Brinson, L Catherine; Bouchard, Kristofer E; Diaz, Daniel; Doglioni, Caterina; Duarte, Javier M; Emani, Murali; Foster, Ian; Fox, Geoffrey; Harris, Philip; Heinrich, Lukas; Jha, Shantenu; Katz, Daniel S; Kindratenko, Volodymyr; Kirkpatrick, Christine R; Lassila-Perini, Kati; Madduri, Ravi K; Neubauer, Mark S; Psomopoulos, Fotis E; Roy, Avik; Rübel, Oliver; Zhao, Zhizhen; Zhu, Ruike.

Sci Data ; 10(1): 487, 2023 07 26.

Article in English | MEDLINE | ID: mdl-37495591

6.

Policy recommendations to ensure that research software is openly accessible and reusable.

McKiernan, Erin C; Barba, Lorena; Bourne, Philip E; Carter, Caitlin; Chandler, Zach; Choudhury, Sayeed; Jacobs, Stephen; Katz, Daniel S; Lieggi, Stefanie; Plale, Beth; Tananbaum, Greg.

PLoS Biol ; 21(7): e3002204, 2023 07.

Article in English | MEDLINE | ID: mdl-37478129

ABSTRACT

Research data is optimized when it can be freely accessed and reused. To maximize research equity, transparency, and reproducibility, policymakers should take concrete steps to ensure that research software is openly accessible and reusable.

Subject(s)

Policy , Software , Reproducibility of Results

7.

Volunteer-contributed observations of flowering often correlate with airborne pollen concentrations.

Crimmins, Theresa M; Vogt, Elizabeth; Brown, Claudia L; Dalan, Dan; Manangan, Arie; Robinson, Guy; Song, Yiluan; Zhu, Kai; Katz, Daniel S W.

Int J Biometeorol ; 67(8): 1363-1372, 2023 Aug.

Article in English | MEDLINE | ID: mdl-37330426

ABSTRACT

Characterizing airborne pollen concentrations is crucial for supporting allergy and asthma management; however, pollen monitoring is labor intensive and, in the USA, geographically limited. The USA National Phenology Network (USA-NPN) engages thousands of volunteer observers in regularly documenting the developmental and reproductive status of plants. The reports of flower and pollen cone status contributed to the USA-NPN's platform, Nature's Notebook, have the potential to help address gaps in pollen monitoring by providing real-time, spatially explicit information from across the country. In this study, we assessed whether observations of flower and pollen cone status contributed to Nature's Notebook can serve as effective proxies for airborne pollen concentrations. We compared daily pollen concentrations from 36 National Allergy Bureau (NAB) stations in the USA with flowering and pollen cone status observations collected within 200 km of each NAB station in each year, 2009-2021, for 15 common tree taxa using Spearman's correlations. Of 350 comparisons, 58% of correlations were significant (p < 0.05). Comparisons could be made at the largest numbers of sites for Acer and Quercus. Quercus demonstrated a comparatively high proportion of tests with significant agreement (median ρ = 0.49). Juglans demonstrated the strongest overall coherence between the two datasets (median ρ = 0.79), though comparisons were made at only a small number of sites. For particular taxa, volunteer-contributed flowering status observations demonstrate promise to indicate seasonal patterns in airborne pollen concentrations. The quantity of observations, and therefore, their utility for supporting pollen alerts, could be substantially increased through a formal observation campaign.

Subject(s)

Hypersensitivity , Quercus , Humans , Allergens , Seasons , Environmental Monitoring , Pollen

8.

Do upper respiratory viruses contribute to racial and ethnic disparities in emergency department visits for asthma?

Bhavnani, Darlene; Wilkinson, Matthew; Zárate, Rebecca A; Balcer-Whaley, Susan; Katz, Daniel S W; Rathouz, Paul J; Matsui, Elizabeth C.

J Allergy Clin Immunol ; 151(3): 778-782.e1, 2023 03.

Article in English | MEDLINE | ID: mdl-36400176

ABSTRACT

BACKGROUND: There are marked disparities in asthma-related emergency department (ED) visit rates among children by race and ethnicity. Following the implementation of coronavirus disease 2019 (COVID-19) prevention measures, asthma-related ED visits rates declined substantially. The decline has been attributed to the reduced circulation of upper respiratory viruses, a common trigger of asthma exacerbations in children. OBJECTIVES: To better understand the contribution of respiratory viruses to racial and ethnic disparities in ED visit rates, we investigated whether the reduction in ED visit rates affected Black, Latinx, and White children with asthma equally. METHODS: Asthma-related ED visits were extracted from electronic medical records at Dell Children's Medical Center in Travis County, Texas. ED visit rates among children with asthma were derived by race/ethnicity. Incidence rate ratios (IRRs) and 95% CIs were estimated by year (2019-2021) and season. RESULTS: In spring 2019, the ED visit IRRs comparing Black children with White children and Latinx children with White children were 6.67 (95% CI = 4.92-9.05) and 2.10 (95% CI = 1.57-2.80), respectively. In spring 2020, when infection prevention measures were implemented, the corresponding IRRs decreased to 1.73 (95% CI = 0.90-3.32) and 0.68 (95% CI = 0.38-1.23), respectively. CONCLUSIONS: The striking reduction of disparities in ED visits suggests that during nonpandemic periods, respiratory viruses contribute to the excess burden of asthma-related ED visits among Black and Latinx children with asthma. Although further investigation is needed to test this hypothesis, our findings raise the question of whether Black and Latinx children with asthma are more vulnerable to upper respiratory viral infections.

Subject(s)

Asthma , COVID-19 , Child , Humans , Emergency Service, Hospital , Asthma/epidemiology , Ethnicity , Texas

9.

Introducing the FAIR Principles for research software.

Barker, Michelle; Chue Hong, Neil P; Katz, Daniel S; Lamprecht, Anna-Lena; Martinez-Ortiz, Carlos; Psomopoulos, Fotis; Harrow, Jennifer; Castro, Leyla Jael; Gruenpeter, Morane; Martinez, Paula Andrea; Honeyman, Tom.

Sci Data ; 9(1): 622, 2022 10 14.

Article in English | MEDLINE | ID: mdl-36241754

ABSTRACT

Research software is a fundamental and vital part of research, yet significant challenges to discoverability, productivity, quality, reproducibility, and sustainability exist. Improving the practice of scholarship is a common goal of the open science, open source, and FAIR (Findable, Accessible, Interoperable and Reusable) communities and research software is now being understood as a type of digital object to which FAIR should be applied. This emergence reflects a maturation of the research community to better understand the crucial role of FAIR research software in maximising research value. The FAIR for Research Software (FAIR4RS) Working Group has adapted the FAIR Guiding Principles to create the FAIR Principles for Research Software (FAIR4RS Principles). The contents and context of the FAIR4RS Principles are summarised here to provide the basis for discussion of their adoption. Examples of implementation by organisations are provided to share information on how to maximise the value of research outputs, and to encourage others to amplify the importance and impact of this work.

10.

A survey of the state of the practice for research software in the United States.

Carver, Jeffrey C; Weber, Nic; Ram, Karthik; Gesing, Sandra; Katz, Daniel S.

PeerJ Comput Sci ; 8: e963, 2022.

Article in English | MEDLINE | ID: mdl-35634111

ABSTRACT

Research software is a critical component of contemporary scholarship. Yet, most research software is developed and managed in ways that are at odds with its long-term sustainability. This paper presents findings from a survey of 1,149 researchers, primarily from the United States, about sustainability challenges they face in developing and using research software. Some of our key findings include a repeated need for more opportunities and time for developers of research software to receive training. These training needs cross the software lifecycle and various types of tools. We also identified the recurring need for better models of funding research software and for providing credit to those who develop the software so they can advance in their careers. The results of this survey will help inform future infrastructure and service support for software developers and users, as well as national research policy aimed at increasing the sustainability of research software.

11.

A FAIR and AI-ready Higgs boson decay dataset.

Chen, Yifan; Huerta, E A; Duarte, Javier; Harris, Philip; Katz, Daniel S; Neubauer, Mark S; Diaz, Daniel; Mokhtar, Farouk; Kansal, Raghav; Park, Sang Eon; Kindratenko, Volodymyr V; Zhao, Zhizhen; Rusack, Roger.

Sci Data ; 9(1): 31, 2022 Feb 14.

Article in English | MEDLINE | ID: mdl-35165298

ABSTRACT

To enable the reusability of massive scientific datasets by humans and machines, researchers aim to adhere to the principles of findability, accessibility, interoperability, and reusability (FAIR) for data and artificial intelligence (AI) models. This article provides a domain-agnostic, step-by-step assessment guide to evaluate whether or not a given dataset meets these principles. We demonstrate how to use this guide to evaluate the FAIRness of an open simulated dataset produced by the CMS Collaboration at the CERN Large Hadron Collider. This dataset consists of Higgs boson decays and quark and gluon background, and is available through the CERN Open Data Portal. We use additional available tools to assess the FAIRness of this dataset, and incorporate feedback from members of the FAIR community to validate our results. This article is accompanied by a Jupyter notebook to visualize and explore this dataset. This study marks the first in a planned series of articles that will guide scientists in the creation of FAIR AI models and datasets in high energy particle physics.

12.

Within city spatiotemporal variation of pollen concentration in the city of Toronto, Canada.

Zapata-Marin, Sara; Schmidt, Alexandra M; Weichenthal, Scott; Katz, Daniel S W; Takaro, Tim; Brook, Jeffrey; Lavigne, Eric.

Environ Res ; 206: 112566, 2022 04 15.

Article in English | MEDLINE | ID: mdl-34922985

ABSTRACT

BACKGROUND: The exacerbation of asthma and respiratory allergies has been associated with exposure to aeroallergens such as pollen. Within an urban area, tree cover, level of urbanization, atmospheric conditions, and the number of source plants can influence spatiotemporal variations in outdoor pollen concentrations. OBJECTIVE: We analyze weekly pollen measurements made between March and October 2018 over 17 sites in Toronto, Canada. The main goals are: to estimate the concentration of different types of pollen across the season; estimate the association, if any, between pollen concentration and environmental variables, and provide a spatiotemporal surface of concentration of different types of pollen across the weeks in the studied period. METHODS: We propose an extension of the land-use regression model to account for the temporal variation of pollen levels and the high number of measurements equal to zero. Inference is performed under the Bayesian framework, and uncertainty of predicted values is naturally obtained through the posterior predictive distribution. RESULTS: Tree pollen was positively associated with commercial areas and tree cover, and negatively associated with grass cover. Both grass and weed pollen were positively associated with industrial areas and TC brightness and negatively associated with the northing coordinate. The total pollen was associated with a combination of these environmental factors. Predicted surfaces of pollen concentration are shown at some sampled weeks for all pollen types. SIGNIFICANCE: The predicted surfaces obtained here can help future epidemiological studies to find possible associations between pollen levels and some health outcome like respiratory allergies at different locations within the study area.

Subject(s)

Allergens , Pollen , Bayes Theorem , Cities , Environmental Monitoring , Poaceae , Seasons

13.

Software Training in HEP.

Malik, Sudhir; Meehan, Samuel; Lieret, Kilian; Oan Evans, Meirin; Villanueva, Michel H; Katz, Daniel S; Stewart, Graeme A; Elmer, Peter; Aziz, Sizar; Bellis, Matthew; Bianchi, Riccardo Maria; Bianco, Gianluca; Bonilla, Johan Sebastian; Burger, Angela; Burzynski, Jackson; Chamont, David; Feickert, Matthew; Gadow, Philipp; Gruber, Bernhard Manfred; Guest, Daniel; Hageboeck, Stephan; Heinrich, Lukas; Horzela, Maximilian M; Huwiler, Marc; Lange, Clemens; Lehmann, Konstantin; Li, Ke; Majumder, Devdatta; Mamuzic, Judita; Nelson, Kevin; Newhouse, Robin; Nibigira, Emery; Norberg, Scarlet; Pineda, Arturo Sánchez; Proffitt, Mason; Regnery, Brendan; Roepe, Amber; Roiser, Stefan; Schreiner, Henry; Shadura, Oksana; Stark, Giordon; Swatman, Stephen Nicholas; Thais, Savannah; Valassi, Andrea; Wunsch, Stefan; Yakobovitch, David; Yuan, Siqi.

Comput Softw Big Sci ; 5(1): 22, 2021.

Article in English | MEDLINE | ID: mdl-34642648

ABSTRACT

The long-term sustainability of the high-energy physics (HEP) research software ecosystem is essential to the field. With new facilities and upgrades coming online throughout the 2020s, this will only become increasingly important. Meeting the sustainability challenge requires a workforce with a combination of HEP domain knowledge and advanced software skills. The required software skills fall into three broad groups. The first is fundamental and generic software engineering (e.g., Unix, version control, C++, and continuous integration). The second is knowledge of domain-specific HEP packages and practices (e.g., the ROOT data format and analysis framework). The third is more advanced knowledge involving specialized techniques, including parallel programming, machine learning and data science tools, and techniques to maintain software projects at all scales. This paper discusses the collective software training program in HEP led by the HEP Software Foundation (HSF) and the Institute for Research and Innovation in Software in HEP (IRIS-HEP). The program equips participants with an array of software skills that serve as ingredients for the solution of HEP computing challenges. Beyond serving the community by ensuring that members are able to pursue research goals, the program serves individuals by providing intellectual capital and transferable skills important to careers in the realm of software and computing, inside or outside HEP.

14.

Erratum: Taking a fresh look at FAIR for research software.

Katz, Daniel S; Gruenpeter, Morane; Honeyman, Tom.

Patterns (N Y) ; 2(5): 100267, 2021 May 14.

Article in English | MEDLINE | ID: mdl-34027503

ABSTRACT

[This corrects the article DOI: 10.1016/j.patter.2021.100222.].

15.

Taking a fresh look at FAIR for research software.

Katz, Daniel S; Gruenpeter, Morane; Honeyman, Tom.

Patterns (N Y) ; 2(3): 100222, 2021 Mar 12.

Article in English | MEDLINE | ID: mdl-33748799

ABSTRACT

Software is increasingly essential in most research, and much of this software is developed specifically for and during research. To make this research software findable, accessible, interoperable, and reusable (FAIR), we need to define exactly what FAIR means for research software and acknowledge that software is a living and complex object for which it is impossible to propose one solution that fits all software.

16.

Pollen production for 13 urban North American tree species: Allometric equations for tree trunk diameter and crown area.

Katz, Daniel S W; Morris, Jonathan R; Batterman, Stuart A.

Aerobiologia (Bologna) ; 36(3): 401-415, 2020 Sep.

Article in English | MEDLINE | ID: mdl-33343061

ABSTRACT

Estimates of airborne pollen concentrations at the urban scale would be useful for epidemiologists, land managers, and allergy sufferers. Mechanistic models could be well suited for this task, but their development will require data on pollen production across cities, including estimates of pollen production by individual trees. In this study, we developed predictive models for pollen production as a function of trunk size, canopy area, and height, which are commonly recorded in tree surveys or readily extracted from remote sensing data. Pollen production was estimated by measuring the number of flowers per tree, the number of anthers per flower, and the number of pollen grains per anther. Variability at each morphological scale was assessed using bootstrapping. Pollen production was estimated for the following species: Acer negundo, Acer platanoides, Acer rubrum, Acer saccharinum, Betula papyrifera, Gleditsia triacanthos, Juglans nigra, Morus alba, Platanus x acerfolia, Populus deltoides, Quercus palustris, Quercus rubra, and Ulmus americana. Basal area predicted pollen production with a mean R2 of 0.72 (range: 0.41 - 0.99), whereas canopy area predicted pollen production with a mean R2 of 0.76 (range: 0.50 - 0.99). These equations are applied to two tree datasets to estimate total municipal pollen production and the spatial distribution of street tree pollen production for the focal species. We present some of the first individual-tree based estimates of pollen production at the municipal scale; the observed spatial heterogeneity in pollen production is substantial and can feasibly be included in mechanistic models of airborne pollen at fine spatial scales.

17.

The challenges of theory-software translation.

Jay, Caroline; Haines, Robert; Katz, Daniel S; Carver, Jeffrey C; Gesing, Sandra; Brandt, Steven R; Howison, James; Dubey, Anshu; Phillips, James C; Wan, Hui; Turk, Matthew J.

F1000Res ; 9: 1192, 2020.

Article in English | MEDLINE | ID: mdl-33214878

ABSTRACT

Background: Software is now ubiquitous within research. In addition to the general challenges common to all software development projects, research software must also represent, manipulate, and provide data for complex theoretical constructs. Ensuring this process of theory-software translation is robust is essential to maintaining the integrity of the science resulting from it, and yet there has been little formal recognition or exploration of the challenges associated with it. Methods: We thematically analyse the outputs of the discussion sessions at the Theory-Software Translation Workshop 2019, where academic researchers and research software engineers from a variety of domains, and with particular expertise in high performance computing, explored the process of translating between scientific theory and software. Results: We identify a wide range of challenges to implementing scientific theory in research software and using the resulting data and models for the advancement of knowledge. We categorise these within the emergent themes of design, infrastructure, and culture, and map them to associated research questions. Conclusions: Systematically investigating how software is constructed and its outputs used within science has the potential to improve the robustness of research software and accelerate progress in its development. We propose that this issue be examined within a new research area of theory-software translation, which would aim to significantly advance both knowledge and scientific practice.

Subject(s)

Computing Methodologies , Software , Engineering , Humans , Knowledge , Research Personnel

18.

Urban-scale variation in pollen concentrations: A single station is insufficient to characterize daily exposure.

Katz, Daniel S W; Batterman, Stuart A.

Aerobiologia (Bologna) ; 36(3): 417-431, 2020 Sep.

Article in English | MEDLINE | ID: mdl-33456131

ABSTRACT

Epidemiological analyses of airborne allergenic pollen often use concentration measurements from a single station to represent exposure across a city, but this approach does not account for the spatial variation of concentrations within the city. Because there are few descriptions of urban-scale variation, the resulting exposure measurement error is unknown but potentially important for epidemiological studies. This study examines urban scale variation in pollen concentrations by measuring pollen concentrations of 13 taxa over 24-hr periods twice weekly at 25 sites in two seasons in Detroit, Michigan. Spatio-temporal variation is described using cumulative distribution functions and regression models. Daily pollen concentrations across the 25 stations varied considerably, and the average quartile coefficient of dispersion was 0.63. Measurements at a single site explained 3-85% of the variation at other sites, depending on the taxon, and 95% prediction intervals of pollen concentrations generally spanned one to two orders of magnitude. These results demonstrate considerable heterogeneity of pollen levels at the urban scale, and suggest that the use of a single monitoring site will not reflect pollen exposure over an urban area and can lead to sizable measurement error in epidemiological studies, particularly when a daily time-step is used. These errors might be reduced by using predictive daily pollen levels in models that combine vegetation maps, pollen production estimates, phenology models and dispersion processes, or by using coarser time-steps in the epidemiological analysis.

19.

Recognizing the value of software: a software citation guide.

Katz, Daniel S; Chue Hong, Neil P; Clark, Tim; Muench, August; Stall, Shelley; Bouquin, Daina; Cannon, Matthew; Edmunds, Scott; Faez, Telli; Feeney, Patricia; Fenner, Martin; Friedman, Michael; Grenier, Gerry; Harrison, Melissa; Heber, Joerg; Leary, Adam; MacCallum, Catriona; Murray, Hollydawn; Pastrana, Erika; Perry, Katherine; Schuster, Douglas; Stockhause, Martina; Yeston, Jake.

F1000Res ; 9: 1257, 2020.

Article in English | MEDLINE | ID: mdl-33500780

ABSTRACT

Software is as integral as a research paper, monograph, or dataset in terms of facilitating the full understanding and dissemination of research. This article provides broadly applicable guidance on software citation for the communities and institutions publishing academic journals and conference proceedings. We expect those communities and institutions to produce versions of this document with software examples and citation styles that are appropriate for their intended audience. This article (and those community-specific versions) are aimed at authors citing software, including software developed by the authors or by others. We also include brief instructions on how software can be made citable, directing readers to more comprehensive guidance published elsewhere. The guidance presented in this article helps to support proper attribution and credit, reproducibility, collaboration and reuse, and encourages building on the work of others to further research.

Subject(s)

Bibliometrics , Publishing , Reproducibility of Results , Software

20.

Managing genomic variant calling workflows with Swift/T.

Ahmed, Azza E; Heldenbrand, Jacob; Asmann, Yan; Fadlelmola, Faisal M; Katz, Daniel S; Kendig, Katherine; Kendzior, Matthew C; Li, Tiffany; Ren, Yingxue; Rodriguez, Elliott; Weber, Matthew R; Wozniak, Justin M; Zermeno, Jennie; Mainzer, Liudmila S.

PLoS One ; 14(7): e0211608, 2019.

Article in English | MEDLINE | ID: mdl-31287816

ABSTRACT

Bioinformatics research is frequently performed using complex workflows with multiple steps, fans, merges, and conditionals. This complexity makes management of the workflow difficult on a computer cluster, especially when running in parallel on large batches of data: hundreds or thousands of samples at a time. Scientific workflow management systems could help with that. Many are now being proposed, but is there yet the "best" workflow management system for bioinformatics? Such a system would need to satisfy numerous, sometimes conflicting requirements: from ease of use, to seamless deployment at peta- and exa-scale, and portability to the cloud. We evaluated Swift/T as a candidate for such role by implementing a primary genomic variant calling workflow in the Swift/T language, focusing on workflow management, performance and scalability issues that arise from production-grade big data genomic analyses. In the process we introduced novel features into the language, which are now part of its open repository. Additionally, we formalized a set of design criteria for quality, robust, maintainable workflows that must function at-scale in a production setting, such as a large genomic sequencing facility or a major hospital system. The use of Swift/T conveys two key advantages. (1) It operates transparently in multiple cluster scheduling environments (PBS Torque, SLURM, Cray aprun environment, etc.), thus a single workflow is trivially portable across numerous clusters. (2) The leaf functions of Swift/T permit developers to easily swap executables in and out of the workflow, which makes it easy to maintain and to request resources optimal for each stage of the pipeline. While Swift/T's data-level parallelism eliminates the need to code parallel analysis of multiple samples, it does make debugging more difficult, as is common for implicitly parallel code. Nonetheless, the language gives users a powerful and portable way to scale up analyses in many computing architectures. The code for our implementation of a variant calling workflow using Swift/T can be found on GitHub at https://github.com/ncsa/Swift-T-Variant-Calling, with full documentation provided at http://swift-t-variant-calling.readthedocs.io/en/latest/.

Subject(s)

Computational Biology , Genomics , Software , Animals , Humans , Workflow

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL